Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation release 0.8.1 #1123

Merged
merged 28 commits into from
Sep 14, 2024
Merged

Preparation release 0.8.1 #1123

merged 28 commits into from
Sep 14, 2024

Conversation

lfoppiano
Copy link
Collaborator

This PR contains the updates for the release 0.8.1

@coveralls
Copy link

Coverage Status

coverage: 40.787%. remained the same
when pulling b6a2a20 on release-0.8.1
into 694f0ed on master.

@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 10, 2024
@coveralls
Copy link

Coverage Status

coverage: 40.799% (+0.01%) from 40.787%
when pulling 4675511 on release-0.8.1
into 694f0ed on master.

@coveralls
Copy link

Coverage Status

coverage: 40.787%. remained the same
when pulling f1d703c on release-0.8.1
into 694f0ed on master.

@coveralls
Copy link

Coverage Status

coverage: 40.787%. remained the same
when pulling c408076 on release-0.8.1
into 694f0ed on master.

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Jun 22, 2024

I've ran the evaluation with a partial glutton (around 80-90M records from

Since I don't have a GPU machine I can log in, I

  1. first ran the extraction using the client + an instance on GPU + partial glutton.
  2. I renamed the files .grobid.tei.xml to .fulltext.tei.xml and then
  3. I ran the evaluation with no regeneration of the grobid extraction.

Since I did not use the standard method, this should be taken with a pinch of salt.

TLDR: Header metadata and citation context performances have decreased, the rest as increased.

======= Header metadata ======= 

Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

abstract             82.45        16.78        16.48        16.63        1911   
authors              95.68        79.94        79.65        79.79        1941   
first_author         98.93        95.29        94.95        95.12        1941   
keywords             94.22        64.99        63.62        64.3         1380   
title                95.65        80.39        79.52        79.95        1943   

all (micro avg.)     93.39        67.94        67.21        67.57        9116   
all (macro avg.)     93.39        67.48        66.84        67.16        9116   


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

abstract             92.1         63.83        62.69        63.25        1911   
authors              95.97        81.28        80.99        81.14        1941   
first_author         99.01        95.66        95.31        95.48        1941   
keywords             95.5         73.65        72.1         72.87        1380   
title                97.43        88.87        87.91        88.38        1943   

all (micro avg.)     96           81.2         80.33        80.77        9116   
all (macro avg.)     96           80.66        79.8         80.22        9116   


==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

abstract             97.68        91.05        89.43        90.23        1911   
authors              97.18        87.02        86.71        86.86        1941   
first_author         99.1         96.12        95.78        95.95        1941   
keywords             97.05        84.16        82.39        83.27        1380   
title                98.55        94.17        93.15        93.66        1943   

all (micro avg.)     97.91        90.91        89.93        90.42        9116   
all (macro avg.)     97.91        90.51        89.49        89.99        9116   


= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

abstract             96.88        87.11        85.56        86.33        1911   
authors              96.36        83.14        82.84        82.99        1941   
first_author         98.93        95.29        94.95        95.12        1941   
keywords             96.35        79.42        77.75        78.58        1380   
title                98.15        92.3         91.3         91.8         1943   

all (micro avg.)     97.33        87.97        87.02        87.49        9116   
all (macro avg.)     97.33        87.45        86.48        86.96        9116   

===== Instance-level results =====

Total expected instances:       1943
Total correct instances:        195 (strict) 
Total correct instances:        786 (soft) 
Total correct instances:        1274 (Levenshtein) 
Total correct instances:        1121 (ObservedRatcliffObershelp) 

Instance-level recall:  10.04   (strict) 
Instance-level recall:  40.45   (soft) 
Instance-level recall:  65.57   (Levenshtein) 
Instance-level recall:  57.69   (RatcliffObershelp) 

======= Citation metadata ======= 

Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

authors              97.58        83.04        76.32        79.54        85778  
date                 99.23        94.61        84.26        89.13        87067  
first_author         98.53        89.78        82.5         85.99        85778  
inTitle              96.19        73.23        71.88        72.55        81007  
issue                99.68        91.11        87.76        89.41        16635  
page                 98.61        94.57        83.7         88.81        80501  
title                97.21        79.67        75.31        77.43        80736  
volume               99.44        96.02        89.83        92.82        80067  

all (micro avg.)     98.31        87.22        80.75        83.86        597569 
all (macro avg.)     98.31        87.76        81.44        84.46        597569 


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

authors              97.65        83.51        76.76        79.99        85778  
date                 99.23        94.61        84.26        89.13        87067  
first_author         98.55        89.95        82.66        86.15        85778  
inTitle              97.85        84.92        83.35        84.13        81007  
issue                99.68        91.11        87.76        89.41        16635  
page                 98.61        94.57        83.7         88.81        80501  
title                98.82        91.44        86.43        88.87        80736  
volume               99.44        96.02        89.83        92.82        80067  

all (micro avg.)     98.73        90.62        83.89        87.13        597569 
all (macro avg.)     98.73        90.77        84.34        87.41        597569 


==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

authors              98.45        89.22        82           85.46        85778  
date                 99.23        94.61        84.26        89.13        87067  
first_author         98.58        90.16        82.85        86.35        85778  
inTitle              98.03        86.18        84.59        85.37        81007  
issue                99.68        91.11        87.76        89.41        16635  
page                 98.61        94.57        83.7         88.81        80501  
title                99.14        93.81        88.66        91.16        80736  
volume               99.44        96.02        89.83        92.82        80067  

all (micro avg.)     98.9         91.97        85.14        88.42        597569 
all (macro avg.)     98.9         91.96        85.46        88.56        597569 


= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

authors              98           85.98        79.03        82.36        85778  
date                 99.23        94.61        84.26        89.13        87067  
first_author         98.53        89.8         82.52        86           85778  
inTitle              97.65        83.5         81.95        82.72        81007  
issue                99.68        91.11        87.76        89.41        16635  
page                 98.61        94.57        83.7         88.81        80501  
title                99.08        93.4         88.28        90.77        80736  
volume               99.44        96.02        89.83        92.82        80067  

all (micro avg.)     98.78        91.02        84.26        87.51        597569 
all (macro avg.)     98.78        91.13        84.67        87.75        597569 

===== Instance-level results =====

Total expected instances:               90125
Total extracted instances:              85898
Total correct instances:                38759 (strict) 
Total correct instances:                50899 (soft) 
Total correct instances:                55786 (Levenshtein) 
Total correct instances:                52324 (RatcliffObershelp) 

Instance-level precision:       45.12 (strict) 
Instance-level precision:       59.26 (soft) 
Instance-level precision:       64.94 (Levenshtein) 
Instance-level precision:       60.91 (RatcliffObershelp) 

Instance-level recall:  43.01   (strict) 
Instance-level recall:  56.48   (soft) 
Instance-level recall:  61.9    (Levenshtein) 
Instance-level recall:  58.06   (RatcliffObershelp) 

Instance-level f-score: 44.04 (strict) 
Instance-level f-score: 57.83 (soft) 
Instance-level f-score: 63.38 (Levenshtein) 
Instance-level f-score: 59.45 (RatcliffObershelp) 

Matching 1 :    68335

Matching 2 :    4155

Matching 3 :    1859

Matching 4 :    662

Total matches : 75011

======= Citation context resolution ======= 

Total expected references:       90125 - 46.38 references per article
Total predicted references:      85898 - 44.21 references per article

Total expected citation contexts:        139835 - 71.97 citation contexts per article
Total predicted citation contexts:       115386 - 59.39 citation contexts per article

Total correct predicted citation contexts:       97290 - 50.07 citation contexts per article
Total wrong predicted citation contexts:         18096 (wrong callout matching, callout missing in NLM, or matching with a bib. ref. not aligned with a bib.ref. in NLM)

Precision citation contexts:     84.32
Recall citation contexts:        69.57
fscore citation contexts:        76.24

======= Fulltext structures ======= 

Evaluation on 1943 random PDF files out of 1941 PDF (ratio 1.0).

======= Strict Matching ======= (exact matches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

figure_title         96.63        31.47        24.64        27.64        7281   
reference_citation   59.15        57.42        58.68        58.05        134196 
reference_figure     94.74        61.21        65.9         63.47        19330  
reference_table      99.22        83.01        88.39        85.62        7327   
section_title        94.73        76.39        67.76        71.82        27619  
table_title          98.76        57.29        50.29        53.56        3971   

all (micro avg.)     90.54        60.41        60.32        60.36        199724 
all (macro avg.)     90.54        61.13        59.28        60.02        199724 


======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)

===== Field-level results =====

label                accuracy     precision    recall       f1           support

figure_title         98.52        78.72        61.63        69.13        7281   
reference_citation   61.86        61.68        63.03        62.34        134196 
reference_figure     94.6         61.69        66.41        63.97        19330  
reference_table      99.2         83.19        88.58        85.8         7327   
section_title        95.43        81.25        72.07        76.38        27619  
table_title          99.35        81.87        71.87        76.55        3971   

all (micro avg.)     91.49        65.76        65.67        65.72        199724 
all (macro avg.)     91.49        74.73        70.6         72.36        199724 


====================================================================================

@lfoppiano
Copy link
Collaborator Author

I'm attaching all the results as files for completeness:

@kermitt2
Copy link
Owner

kermitt2 commented Jul 2, 2024

Hi Luca ! I think there is a major issue with the the jvm version indicated by the Kotlin jvmToolchain

kotlin {
        jvmToolchain(17)
    }

The classes and jar become incompatible with jvm lower than 17... So it's not possible to run grobid any more with a jvm 11:

Error: LinkageError occurred while loading main class org.grobid.trainer.NameAddressTrainer
        java.lang.UnsupportedClassVersionError: org/grobid/trainer/NameAddressTrainer has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0

In addition, it has blocking consequences for other modules and libraries using grobid which can't be run with jvm 17.

The solution seems to simply make everything to java 11:

    kotlin {
        jvmToolchain(11)
    }

although source compatibility java 11 is not working:

    sourceCompatibility = 1.11
    targetCompatibility = 1.11

gives

lopez@smallbook:~/grobid$ ./gradlew clean install

FAILURE: Build failed with an exception.

* Where:
Build file '/home/lopez/grobid/build.gradle' line: 268

* What went wrong:
Could not determine the dependencies of task ':grobid-core:shadowJar'.
> The new Java toolchain feature cannot be used at the project level in combination with source and/or target compatibility

@kermitt2
Copy link
Owner

kermitt2 commented Jul 2, 2024

It seems the Java 11 compatibility is broken by the recent changes in FundingAcknowledgementParser:

./gradlew clean install

> Task :grobid-core:compileJava
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:193: error: cannot find symbol
                List<OffsetPosition> annotationsPositionTokens = annotations.stream().map(AnnotatedXMLElement::getOffsetPosition).toList();
                                                                                                                                 ^
  symbol:   method toList()
  location: interface Stream<OffsetPosition>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:253: error: cannot find symbol
            .map(AnnotatedXMLElement::getOffsetPosition).toList());
                                                        ^
  symbol:   method toList()
  location: interface Stream<OffsetPosition>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:259: error: cannot find symbol
                .toList();
                ^
  symbol:   method toList()
  location: interface Stream<OffsetPosition>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:266: error: cannot find symbol
                    .toList();
                    ^
  symbol:   method toList()
  location: interface Stream<Integer>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:294: error: cannot find symbol
                            .toList());
                            ^
  symbol:   method toList()
  location: interface Stream<BoundingBox>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:304: error: cannot find symbol
                        String coordsAsString = String.join(";", postMergeBoxes.stream().map(BoundingBox::toString).toList());
                                                                                                                   ^
  symbol:   method toList()
  location: interface Stream<String>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:372: error: cannot find symbol
                    .toList();
                    ^
  symbol:   method toList()
  location: interface Stream<AnnotatedXMLElement>
/home/lopez/grobid/grobid-core/src/main/java/org/grobid/core/engines/FundingAcknowledgementParser.java:410: error: cannot find symbol
                        .toList();
                        ^
  symbol:   method toList()
  location: interface Stream<AnnotatedXMLElement>

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Jul 2, 2024

Hi @kermitt2,
I was going to do it after, with the idea of upgrading to 17 what's needs to be upgraded.

I checked grobid-quantities, software-mentions, datastet and they seems to be compatible with JDK 17. I would say that the old modules may stay with an older version.
In any case, I can help you on updating and testing them. Let me know what I can do.

If you want to keep jdk 11 compatibility, for the second problem, you can replace toList() with .collect(Collectors.toList()).

@kermitt2
Copy link
Owner

kermitt2 commented Jul 3, 2024

I think it's good to move to JDK 17 in general, but we need to update the other modules first, otherwise this is blocking for users. This is also a general issue for everything that depends on Grobid and for existing production environment where Grobid runs. For example I am currently stuck and failed to upgrade entity-fishing from JDK 8 to JDK 11 and this is very annoying for the users.

I think it's better to ensure JDK 11 compatibility for this release - 17 would be a breaking change for version 0.9.0, especially given that the move to 17 is more for our comfort than providing really actual advantages?

@lfoppiano
Copy link
Collaborator Author

I think it's good to move to JDK 17 in general, but we need to update the other modules first, otherwise this is blocking for users. This is also a general issue for everything that depends on Grobid and for existing production environment where Grobid runs. For example I am currently stuck and failed to upgrade entity-fishing from JDK 8 to JDK 11 and this is very annoying for the users.

OK, no problem. I might be to optimistic in thinking that people would have migrated to Docker by now.

Let me help you with entity-fishing. Could you commit and push everything you've done so far on a branch of the project, I will have a look ASAP 😉
If there are other modules that need to be updated please do let me know.

I think it's better to ensure JDK 11 compatibility for this release - 17 would be a breaking change for version 0.9.0, especially given that the move to 17 is more for our comfort than providing really actual advantages?

Sure. 👍

@lfoppiano
Copy link
Collaborator Author

@kermitt2 I tested the latest commits 56d351c and it work with JDK 11 on my Apple M2.

@kermitt2
Copy link
Owner

kermitt2 commented Jul 4, 2024

Thank you very much @lfoppiano it is working also for me now with jdk 11 on Linux (as you, I usually run jdk 17, and it's why I saw the issue only recently).

About entity-fishing, the master has the latest commit if I am not wrong, and running with grobid 0.8.0 and jdk 11 fails because the current version uses an incubator module that has disappeared after jdk 1.8. I did not analyze further which dependency uses this module and if there is a possible replacement in jdk 11.

@coveralls
Copy link

coveralls commented Jul 17, 2024

Coverage Status

coverage: 40.751% (-0.04%) from 40.787%
when pulling d15e4d2 on release-0.8.1
into 399ef9d on master.

@kermitt2
Copy link
Owner

I observed the crashes with more PDF, usually from 10-20K I think, and never getting more than 25-30K PDF.
Both with server running with ./gradlew run or with the Docker image.

When running with gradlew:
JVM 17.0.12 ubuntu build
Linux 6.8.0-40-generic amd64

grobid_client_python with concurrency at 15
grobid service with concurrency unchanged at 10

No crash with JVMtoolkit set to JDK 17 after 700K PDF.

@lfoppiano
Copy link
Collaborator Author

Thanks @kermitt2 !
I artificially enlarged the set of PDF documents by simply make three copies and merging them with different names, I did tested then on around 30K documents but the JVM did not crash. 😭

I try again with a larger dataset, I might need some more days to assemble it, meanwhile if you still have the JVM dump somewhere, could you share it?

@lfoppiano
Copy link
Collaborator Author

I added an additional 40000 unique articles, to the previous 30000, ran again but could not reproduce the problem. I'm using a 8vCPU with 32Gb of RAM, only CRF with jdk 17, and jdk 11 version of the bytecode. 😭

As alternative, to solve the issue with JDK 11, I could try to run entity-fishing with JDK 17 💦 Are there other modules that require JDK 11?

@kermitt2
Copy link
Owner

Back to the JVM crash problem:

  • I am running on a machine with Ubuntu 22.04 and only JDK 17 installed (from the Ubuntu packages).
  • Having jvmToolchain set to 11 (as in this current branch), I have apparently a JDK 11 downloaded and used by gradle for building+running the project, which results in the JVM crashes. I include 2 examples of SIGSEGV errors coming after running Grobid a while - the indicated compiled method where the crash happens is usually always different from one crash to another
  • As visible in the error report, Gradle has downloaded a JVM version 11 (JRE version: OpenJDK Runtime Environment Temurin-11.0.23+9), which is not installed on my system
  • When jvmToolchain is set to 17, my installed JVM 17 is used, and there is no crash
  • When removing jvmToolchain from gradle and indicating sourceCompatibility = 1.11, my installed JVM 17 is used, and there is no crash.

The same behavior happens when using command line ./gradlew run or when using a docker image.

More info on javaToolchains as appearing on my system:

:~/grobid$ ./gradlew -q javaToolchains

 + Options
     | Auto-detection:     Enabled
     | Auto-download:      Enabled

 + Eclipse Adoptium JDK 11.0.23+9
     | Location:           /home/lopez/.gradle/jdks/jdk-11.0.23+9
     | Language Version:   11
     | Vendor:             Eclipse Adoptium
     | Is JDK:             true
     | Detected by:        Auto-provisioned by Gradle

 + Ubuntu JDK 17.0.12+7-Ubuntu-1ubuntu222.04
     | Location:           /usr/lib/jvm/java-17-openjdk-amd64
     | Language Version:   17
     | Vendor:             Ubuntu
     | Is JDK:             true
     | Detected by:        Common Linux Locations

 + Invalid toolchains
     + /usr/lib/jvm/openjdk-17
       | Error:              A problem occurred starting process 'command '/usr/lib/jvm/openjdk-17/bin/java''
  • It could be that this downloaded JDK 11 is not compatible with this Ubuntu system, following this issue:
    A fatal error has been detected by the Java Runtime Environment adoptium/adoptium-support#1156

  • With the current setting, I guess despite the openjdk:17-jdk-slim base image, it will be a JDK 11 that will be used both for build and run, downloaded by gradle when building the image. So this makes the JDK of the base image useless. I suspect it also creates this issue when running on some systems that do not like this JDK 11 selected and downloaded by gradle javaToolchain.

  • For this release, we could maybe remove jvmToolchain from gradle? Kotlin is just used for testing?

logs-error-2.txt
logs-error-1.txt

@lfoppiano
Copy link
Collaborator Author

Hi @kermitt2 thanks again, this indeed helps more understanding the problem. In my test I had the JDK 11.0.24 that was automatically downloaded by gradle.

Anyway, I pushed a small change that should solve the issue and allow us to keep everything 🤞, in brief:

  • in gradle.properties I added a flag to avoid the JDK to download anything automatically (with the javatoochain, this will break the build if toolchain=11, and system=JDK 17),
  • remove any trace of jvmToolchain, and revert to the old working style sourceCompatibility/targetCompatibility = 1.11
  • added a section to build the kotlin stuff without using the jvmToolchain, setting there as well the JDK 11

Regarding the observation with docker, in principle we don't use gradle to run the service, so, I'm not sure why of the crashes... 🤔

@lfoppiano
Copy link
Collaborator Author

I've ran grobid natively with gradle, built with the latest commits on this branch, on ~70000 documents using JDK 17.0.12 and JDK 11.0.24 (installed with the ubuntu 22.04).
I report no issue with any of them. Maybe the issue was a specific with JDK 11.0.23?

@lfoppiano
Copy link
Collaborator Author

I did test also the docker image resulting from my last change and it was not crashing.
I did also investigate the reason why we have the crash when mixing JDK 17/11 in docker, but I cannot find an answer, because the only JDK available is the 17 and is the one that it's used by the script (we use the distribution script, rather than the gradle run), anyway as previously mentioned by you, @kermitt2 we can ship anyway a JDK 17/17 version with docker.

@lfoppiano
Copy link
Collaborator Author

For version 0.8.1 I have set up the infrastructure so that I can reproduce the same end 2 end evaluation results :-)
For running it on Linux natively with DL and conda, I needed to use the branch #1010 (not to be added in this release)

@kermitt2
Copy link
Owner

I made some test with the updated version without jvmToolchain and automatic download of JVM and I had no problem anymore. So with sourceCompatibility/targetCompatibility = 1.11 both my Ubuntu local JDK 17 and 11 work fine on large volume of PDF.

The problem I think was related to the built version of the JDK downloaded by jvmToolchain. It was a JDK 11 distribution from eclipse (OpenJDK Runtime Environment Temurin-11.0.23+9 ), while normally we should use the Ubuntu packaged one for safety. It means jvmToolchain might not be reliable in the future, because it might download one JDK built independently from the linux distribution instead of the one specifically built for the used linux distribution.

For the docker image, I suppose the Grobid project was built with the downloaded JDK 11 (in the first build layer), then Ubuntu JRE 17 from the base image was used in the runtime, so possible clash of JDK here.

I think we're good for the release ? :)

@lfoppiano
Copy link
Collaborator Author

Great!!!!!

I can take care of the release, leaving to you only double checking it? 😄

@lfoppiano lfoppiano merged commit 4cad850 into master Sep 14, 2024
10 of 12 checks passed
@lfoppiano
Copy link
Collaborator Author

Grobid

  • Jars ✅
  • Docker CRF ✅
  • Docker Full ✅ (lfoppiano/grobid:0.8.1-full, but can be re-tagged quickly, see below)

The docker images were built with github actions. I just re-tagged it accordingly. You can save time for build by re-tagging the full image and push it under grobid:

  docker pull lfoppiano/grobid:0.8.1-full
  docker tag lfoppiano/grobid:0.8.1-full grobid/grobid:0.8.1-full
  docker push grobid/grobid:0.8.1-full

Grobid modules

Here the list of grobid modules, I did not included the one that are old, it's hard to maintain everything, @kermitt2 feel free to add if there are other

  • Pub2TEI ✅
  • DataStet (updated the DataSeer's version only) ✅
  • Grobid quantities ✅
  • Software Mentions
  • Entity-fishing

Since I cannot control the S3 repository, I usually ship the JARs with the repository as flat dependencies, this requires specify all the dependencies, but I don't know anything better.

@kermitt2 do you want me to update Software Mentions and Entity-fishing as well?

@kermitt2
Copy link
Owner

@lfoppiano all the artifacts for 0.8.1 have been published on https://grobid.s3.eu-west-1.amazonaws.com/repo
I don't have any errors when building from the DIY repo. Can you give me some info about your errors?

@kermitt2
Copy link
Owner

@lfoppiano I'll update software-mentions, entity-fishing, DataStet sure

@lfoppiano
Copy link
Collaborator Author

I dont' have any particular error, but if I decide to move to a SNAPSHOT version for development I will need to ship the JARs anyway in my repo.

OK. For DataStet I've updated the DataSeer's branch (https://github.com/DataSeer/datastet) cause I don't have access to your repository. I'm not sure I pushed up some PRs already.

@kermitt2
Copy link
Owner

I dont' have any particular error, but if I decide to move to a SNAPSHOT version for development I will need to ship the JARs anyway in my repo.

Does it mean it is working ? You normally have snapshot versions in your local maven repo for development. These DIY stuff anyway are more for java clients, but you should never need a localLibs/grobid-core-0.8.1.jar added in a project no?

@lfoppiano
Copy link
Collaborator Author

lfoppiano commented Sep 14, 2024

Yes it works. :-)

For grobid-quantities and grobid-superconductors I do ship the jars in the repo. In this case, grobid-superconductors also ships the grobid-quantities's JAR.

@lfoppiano lfoppiano deleted the release-0.8.1 branch September 14, 2024 12:34
@lfoppiano
Copy link
Collaborator Author

@kermitt2 for DataStet I've implemented few useful things: 1) TEI processing and 2) parallel processing for DataSeerML (I know it's obsolete, but it was needed at DataSeer) 3) refactor the build using the grobid-full image.

I will send a couple of PRs next week. Would be good to have a review (without rush) so that I can consolidate my knowledge on the application for the BSO project 😄

@kermitt2
Copy link
Owner

@lfoppiano So on my side, I have updated software-mentions, grobid-ner, DataStet (standard), entity-fishing, grobid demo on HuggingFace.

I will study the PR for DataStet carefully because processing a TEI is likely very complicated. Great addition I think.

I notice that the Docker image for Grobid is 2 GB larger than before (compressed size) with 0.8.1. Not that it is a problem I think, but any particular reasons?

@lfoppiano
Copy link
Collaborator Author

Great thanks!

The image of the 0.8.1 that I built via github actions is 10.92 Gb (compressed), version 0.8.0 was 10.5 approximately. 🤔 It might be something to do with your build (maybe you used a source with additional models that have been included?).

@kermitt2
Copy link
Owner

It might be something to do with your build (maybe you used a source with additional models that have been included?).

Ah yes sorry, this is exactly what happened :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants